Feature Generation for Sequence Categorization

نویسندگان

  • Daniel Kudenko
  • Haym Hirsh
چکیده

The problem of sequence categorization is to generalize from a corpus of labeled sequences procedures for accurately labeling future unlabeled sequences. The choice of representation of sequences can have a major impact on this task, and in the absence of background knowledge a good representation is often not known and straightforward representations are often far from optimal. We propose a feature generation method (called FGEN) that creates Boolean features that check for the presence or absence of heuristically selected collections of subsequences. We show empirically that the representation computed by FGEN improves the accuracy of two commonly used learning systems (C4.5 and Ripper) when the new features are added to existing representations of sequence data. We show the superiority of FGEN across a range of tasks selected from three domains: DNA sequences, Unix command sequences, and English text.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature-Based Learners for Description Logics

Inductive learning algorithms that have been applied to learning in description logics (DL) have not been as well studied and optimized as the more general class of feature-based learning algorithms. This paper proposes a way to apply feature-based learners to DL learning tasks by presenting a method to compute a feature-vector representation for DL instances. The representation is based on con...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Unsupervised Feature Generation using Knowledge Repositories for Effective Text Categorization

We propose an unsupervised feature generation algorithm using the repositories of human knowledge for effective text categorization. Conventional bag of words (BOW) depends on the presence / absence of keywords to classify the documents. To understand the actual context behind these keywords, we use knowledge concepts / hyperlinks from external knowledge sources through content and structure mi...

متن کامل

An NTU-Approach to Automatic Sentence Extraction for Summary Generation

A B S T R A C T Automatic summarization and information extraction are two important Internet services. MUC and SUMMAC play their appropriate roles in the next generation Internet. This paper focuses on the automatic summarization and proposes two different models to extract sentences for summary generation under two tasks initiated by SUMMAC-1. For categorization task, positive feature vectors...

متن کامل

A feature selection technique for generation of classification committees and its application to categorization of laryngeal images

Article history: Received 25 August 2007 Received in revised form 12 August 2008 Accepted 26 August 2008

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998